AITopics | malware detection

Collaborating Authors

malware detection

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

EMBERSim: A Large-Scale Databank for Boosting Similarity Search in Malware Analysis Dragos

Neural Information Processing SystemsFeb-11-2026, 22:57:28 GMT

Moreover, we observe that the focus in the few related works falls on quantifying similarity in malware, often overlooking the clean data. This one-sided quantification is especially dangerous in the context of detection bypass.

data mining, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Genre: Research Report (0.46)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Therapeutic Area > Neurology (0.68)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.93)
(4 more...)

Add feedback

RS-Del: Edit Distance Robustness Certificates for Sequence Classifiers via Randomized Deletion

Neural Information Processing SystemsFeb-10-2026, 14:38:45 GMT

We present a case study on malware detection--a binary classification problem on byte sequences where classifier evasion is a well-established threat model.

classifier, data mining, machine learning, (20 more...)

Neural Information Processing Systems

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Oceania > Australia (0.04)
North America > United States > Wisconsin > Dane County > Madison (0.04)
(2 more...)

Genre: Research Report > New Finding (0.67)

Industry:

Information Technology > Security & Privacy (1.00)
Government (0.93)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(4 more...)

Add feedback

9fd98f856d3ca2086168f264a117ed7c-Paper.pdf

Neural Information Processing SystemsFeb-10-2026, 07:45:21 GMT

non-uniform perturbation, perturbation, robustness, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Santa Clara County > Santa Clara (0.04)
North America > United States > California > Los Angeles County > Santa Monica (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
Europe > France (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Communications > Social Media (0.93)
Information Technology > Data Science > Data Mining (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Evaluating the robustness of adversarial defenses in malware detection systems

Jafari, Mostafa, Shameli-Sendi, Alireza

arXiv.org Artificial IntelligenceDec-9-2025

Machine learning is a key tool for Android malware detection, effectively identifying malicious patterns in apps. However, ML-based detectors are vulnerable to evasion attacks, where small, crafted changes bypass detection. Despite progress in adversarial defenses, the lack of comprehensive evaluation frameworks in binary-constrained domains limits understanding of their robustness. We introduce two key contributions. First, Prioritized Binary Rounding, a technique to convert continuous perturbations into binary feature spaces while preserving high attack success and low perturbation size. Second, the sigma-binary attack, a novel adversarial method for binary domains, designed to achieve attack goals with minimal feature changes. Experiments on the Malscan dataset show that sigma-binary outperforms existing attacks and exposes key vulnerabilities in state-of-the-art defenses. Defenses equipped with adversary detectors, such as KDE, DLA, DNN+, and ICNN, exhibit significant brittleness, with attack success rates exceeding 90% using fewer than 10 feature modifications and reaching 100% with just 20. Adversarially trained defenses, including AT-rFGSM-k, AT-MaxMA, improves robustness under small budgets but remains vulnerable to unrestricted perturbations, with attack success rates of 99.45% and 96.62%, respectively. Although PAD-SMA demonstrates strong robustness against state-of-the-art gradient-based adversarial attacks by maintaining an attack success rate below 16.55%, the sigma-binary attack significantly outperforms these methods, achieving a 94.56% success rate under unrestricted perturbations. These findings highlight the critical need for precise method like sigma-binary to expose hidden vulnerabilities in existing defenses and support the development of more resilient malware detection systems.

artificial intelligence, machine learning, optimization problem, (19 more...)

arXiv.org Artificial Intelligence

doi: 10.1016/j.compeleceng.2025.110845

2505.09342

Country:

Europe (0.45)
North America > Canada (0.28)
Asia > Middle East (0.28)

Genre:

Research Report > New Finding (1.00)
Overview (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
(2 more...)

Add feedback

Binary-30K: A Heterogeneous Dataset for Deep Learning in Binary Analysis and Malware Detection

Bommarito, Michael J. II

arXiv.org Artificial IntelligenceDec-1-2025

Deep learning research for binary analysis faces a critical infrastructure gap. Today, existing datasets target single platforms, require specialized tooling, or provide only hand-engineered features incompatible with modern neural architectures; no single dataset supports accessible research and pedagogy on realistic use cases. To solve this, we introduce Binary-30K, the first heterogeneous binary dataset designed for sequence-based models like transformers. Critically, Binary-30K covers Windows, Linux, macOS, and Android across 15+ CPU architectures. With 29,793 binaries and approximately 26.93% malware representation, Binary-30K enables research on platform-invariant detection, cross-target transfer learning, and long-context binary understanding. The dataset provides pre-computed byte-level BPE tokenization alongside comprehensive structural metadata, supporting both sequence modeling and structure-aware approaches. Platform-first stratified sampling ensures representative coverage across operating systems and architectures, while distribution via Hugging Face with official train/validation/test splits enables reproducible benchmarking. The dataset is publicly available at https://huggingface.co/datasets/mjbommar/binary-30k, providing an accessible resource for researchers, practitioners, and students alike.

artificial intelligence, deep learning, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2511.22095

Genre:

Research Report (1.00)
Instructional Material (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Accuracy and Efficiency Trade-Offs in LLM-Based Malware Detection and Explanation: A Comparative Study of Parameter Tuning vs. Full Fine-Tuning

Gravereaux, Stephen C., Islam, Sheikh Rabiul

arXiv.org Artificial IntelligenceNov-26-2025

Abstract--This study examines whether Low-Rank Adaptation (LoRA) fine-tuned Large Language Models (LLMs) can approximate the performance of fully fine-tuned models in generating human-interpretable decisions and explanations for malware classification. Achieving trustworthy malware detection, particularly when LLMs are involved, remains a significant challenge. We developed an evaluation framework using Bilingual Evaluation Understudy (BLEU), Recall-Oriented Understudy for Gisting Evaluation (ROUGE), and Semantic Similarity Metrics to benchmark explanation quality across five LoRA configurations and a fully fine-tuned baseline. Results indicate that full fine-tuning achieves the highest overall scores, with BLEU and ROUGE improvements of up to 10% over LoRA variants. However, mid-range LoRA models deliver competitive performance--exceeding full fine-tuning on two metrics--while reducing model size by approximately 81% and training time by over 80% on a LoRA model with 15.5% trainable parameters. These findings demonstrate that LoRA offers a practical balance of interpretability and resource efficiency, enabling deployment in resource-constrained environments without sacrificing explanation quality. By providing feature-driven natural language explanations for malware classifications, this approach enhances transparency, analyst confidence, and operational scalability in malware detection systems. Modern AI-based malware detection systems often lack trustworthiness, particularly when LLMs are involved, limiting analysts' ability to validate automated decisions and improve detection strategies.

explanation, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2511.19654

Country: North America > United States (0.15)

Genre: Research Report > New Finding (0.48)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Synthetic Data: AI's New Weapon Against Android Malware

Nogueira, Angelo Gaspar Diniz, Paim, Kayua Oleques, Bragança, Hendrio, Mansilha, Rodrigo Brandão, Kreutz, Diego

arXiv.org Artificial IntelligenceNov-26-2025

The ever-increasing number of Android devices and the accelerated evolution of malware, reaching over 35 million samples by 2024, highlight the critical importance of effective detection methods. Attackers are now using Artificial Intelligence to create sophisticated malware variations that can easily evade traditional detection techniques. Although machine learning has shown promise in malware classification, its success relies heavily on the availability of up-to-date, high-quality datasets. The scarcity and high cost of obtaining and labeling real malware samples presents significant challenges in developing robust detection models. In this paper, we propose MalSynGen, a Malware Synthetic Data Generation methodology that uses a conditional Generative Adversarial Network (cGAN) to generate synthetic tabular data. This data preserves the statistical properties of real-world data and improves the performance of Android malware classifiers. We evaluated the effectiveness of this approach using various datasets and metrics that assess the fidelity of the generated data, its utility in classification, and the computational efficiency of the process. Our experiments demonstrate that MalSynGen can generalize across different datasets, providing a viable solution to address the issues of obsolescence and low quality data in malware detection. With approximately 3 billion Android devices in operation worldwide [1], the mobile cybersecurity landscape faces formidable challenges. In 2024 alone, Kaspersky reported over 33.3 million cyberattacks targeting smartphone users globally, encompassing diverse forms of malware and unwanted software [2]. Adding to this problem, attackers are using Artificial Intelligence (AI) to rapidly generate new malware variants by exploiting patterns learned from existing malware [3].

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2511.19649

Country: South America > Brazil > Rio Grande do Sul (0.14)

Genre: Research Report > New Finding (0.94)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military > Cyberwarfare (0.68)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Communications > Mobile (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(2 more...)

Add feedback

Reducing Instability in Synthetic Data Evaluation with a Super-Metric in MalDataGen

da Silva, Anna Luiza Gomes, Kreutz, Diego, Diniz, Angelo, Mansilha, Rodrigo, da Fonseca, Celso Nobre

arXiv.org Artificial IntelligenceNov-21-2025

Evaluating the quality of synthetic data remains a persistent challenge in the Android malware domain due to instability and the lack of standardization among existing metrics. Experiments involving ten generative models and five balanced datasets demonstrate that the Super-Metric is more stable and consistent than traditional metrics, exhibiting stronger correlations with the actual performance of classifiers. Synthetic data generation has become an increasingly relevant strategy in cybersecurity [1], [2], [3], particularly as a way to mitigate the scarcity of real, complete, and high-quality datasets that limit the performance and generalization of machine learning models. Despite these advances, assessing the quality of synthetic data remains a complex and largely non-standardized methodological challenge [4], with no clear consensus on which metrics should be used or how to combine them consistently. The literature reports a significant fragmentation in the application of fidelity metrics, with studies identifying more than 65 distinct indicators used independently to assess fidelity [5]. This diversity hinders model-to-model comparison, reduces experimental reproducibility, and complicates the integrated interpretation of data quality.

artificial intelligence, machine learning, super-metric, (16 more...)

arXiv.org Artificial Intelligence

2511.16373

Country: South America > Brazil (0.15)

Genre: Research Report (0.64)

Industry: Information Technology > Security & Privacy (0.79)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

AutoMalDesc: Large-Scale Script Analysis for Cyber Threat Research

Apostu, Alexandru-Mihai, Preda, Andrei, Damir, Alexandra Daniela, Bolocan, Diana, Ionescu, Radu Tudor, Croitoru, Ioana, Gaman, Mihaela

arXiv.org Artificial IntelligenceNov-18-2025

Generating thorough natural language explanations for threat detections remains an open problem in cybersecurity research, despite significant advances in automated malware detection systems. In this work, we present AutoMalDesc, an automated static analysis summarization framework that, following initial training on a small set of expert-curated examples, operates independently at scale. This approach leverages an iterative self-paced learning pipeline to progressively enhance output quality through synthetic data generation and validation cycles, eliminating the need for extensive manual data annotation. Evaluation across 3,600 diverse samples in five scripting languages demonstrates statistically significant improvements between iterations, showing consistent gains in both summary quality and classification accuracy. Our comprehensive validation approach combines quantitative metrics based on established malware labels with qualitative assessment from both human experts and LLM-based judges, confirming both technical precision and linguistic coherence of generated summaries. To facilitate reproducibility and advance research in this domain, we publish our complete dataset of more than 100K script samples, including annotated seed (0.9K) and test (3.6K)

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2511.13333

Genre: Research Report (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military > Cyberwarfare (0.34)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.48)

Add feedback

Retrofit: Continual Learning with Bounded Forgetting for Security Applications

He, Yiling, Lei, Junchi, She, Hongyu, Shao, Shuo, Zheng, Xinran, Liu, Yiping, Qin, Zhan, Cavallaro, Lorenzo

arXiv.org Artificial IntelligenceNov-17-2025

Modern security analytics are increasingly powered by deep learning models, but their performance often degrades as threat landscapes evolve and data representations shift. While continual learning (CL) offers a promising paradigm to maintain model effectiveness, many approaches rely on full retraining or data replay, which are infeasible in data-sensitive environments. Moreover, existing methods remain inadequate for security-critical scenarios, facing two coupled challenges in knowledge transfer: preserving prior knowledge without old data and integrating new knowledge with minimal interference. We propose RETROFIT, a data retrospective-free continual learning method that achieves bounded forgetting for effective knowledge transfer. Our key idea is to consolidate previously trained and newly fine-tuned models, serving as teachers of old and new knowledge, through parameter-level merging that eliminates the need for historical data. To mitigate interference, we apply low-rank and sparse updates that confine parameter changes to independent subspaces, while a knowledge arbitration dynamically balances the teacher contributions guided by model confidence. Our evaluation on two representative applications demonstrates that RETROFIT consistently mitigates forgetting while maintaining adaptability. In malware detection under temporal drift, it substantially improves the retention score, from 20.2% to 38.6% over CL baselines, and exceeds the oracle upper bound on new data. In binary summarization across decompilation levels, where analyzing stripped binaries is especially challenging, RETROFIT achieves around twice the BLEU score of transfer learning used in prior work and surpasses all baselines in cross-representation generalization.

artificial intelligence, knowledge, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2511.11439

Genre: Research Report > New Finding (0.45)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback